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Abstract 

We address the task of predicting pose for objects of 
unannotated object categories from a small seed set of an¬ 
notated object classes. We present a generalized classifier 
that can reliably induce pose given a single instance of a 
novel category. In case of availability of a large collection 
of novel instances, our approach then jointly reasons over 
all instances to improve the initial estimates. We empiri¬ 
cally validate the various components of our algorithm and 
quantitatively show that our method produces reliable pose 
estimates. We also show qualitative results on a diverse set 
of classes and further demonstrate the applicability of our 
system for learning shape models of novel object classes. 

1. Introduction 

Class-based processing significantly simplifies tasks 
such as object segmentation [17, 4], reconstruction [6, 21, 
38] and, more generally, the propagation of knowledge from 
class objects we have seen before to those we are seeing 
for the first time. Looking at the lion in Figure 1 humans 
can not only easily perceive its shape, but also tell that it is 
strong and dangerous, get an estimate of its weight and di¬ 
mensions and even approximate age and gender. We get to 
know all of this because it is a lion like others we have seen 
before and that we know many facts about. 

Despite its many virtues, class-based processing does not 
scale well. Learning predictors for all variables of interest - 
figure-ground segmentation, pose, shape - requires expen¬ 
sive manual annotations to be collected for at least dozens 
of examples per class and there are millions of classes. Con¬ 
sider again Figure 1 but now look at object A. The under¬ 
lying structure in our visual world allows us to perceive a 
rich representation of this object despite encountering it for 
the first time. We can infer that it is probably hair that cov¬ 
ers its surfaces - we have seen plenty of hair-like materials 
before - and that it has parts and determine their config¬ 
uration by analogy with our own parts or with other ani¬ 
mals. We are able to achieve this remarkable feat by lever- 

Our implementations and trained models are available at https : / / 
github.com/shubhtuls/poseinduction 



Figure 1. Inductive pose inference for novel objects. Right: Novel 
object A. Left: instances from previously seen classes having sim¬ 
ilar pose as object A. 

aging commonalities across object categories via general- 
izable abstractions - not only can we perceive that all the 
other animals in Figure 1 are “right-facing”, we can also 
transfer this notion to object A. This type of cross-category 
knowledge transfer has been successfully demonstrated be¬ 
fore for properties such as materials [37, 8], parts [35, 10] 
and attributes [22, 13]. 

In this paper we define and attack the problem of pre¬ 
dicting object poses across categories - we call this pose 
induction. The first step of our approach, as highlighted 
in Figure 2, is to learn a generalizable pose prediction sys¬ 
tem from the given set of annotated object categories. Our 
main intuition is that most objects have appearance and 
shape traits that can be associated with a generalized no¬ 
tion of pose. For example, the sentences “I am in front of 
a car” or “in front of a bus” or “in front of a lion” are clear 
about where“r’ am with respect to those objects. The rea¬ 
son for this may be that there is something generic in the 
way“frontality” manifests itself visually across different ob¬ 
ject classes - e.g.“fronts” usually exhibit an axis of bilateral 
symmetry. Pushing this observation further leads to our so¬ 
lution: to align all the objects in a small seed set of classes, 
by endowing them with set of 3D rotations in a consistent 
reference frame, then training pose predictors that general¬ 
ize in a meaningful way to novel object classes. 

This idea expands the current range of inferences that 
can be performed in a class-independent manner and allows 
us to reason about pose for every object without tediously 
collecting pose annotations. Such pose based reasoning can 
then inform a system about which directions objects are 
most likely to move in (usually “front” or “back”) and hence 
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Annotated Object Classes Pose Hypotheses for Novel Instances Joint Reasoning over Instances 

Figure 2. Overview of our approach. We first induce pose hypotheses for novel object instances using a system trained over aligned 
annotated classes (Section 2). We then reason jointly over all instances of the novel object class to improve our pose predictions (Section 3). 


allow it to get out of their way; it can help to identify how to 
place any object on top a surface in a stable way (by identi¬ 
fying the “bottom” of the object). Ultimately, and the main 
motivation for this work, it provides important cues about 
the 3D shape of a novel object and may allow bypassing 
the existing need for ground truth keypoints in training data 
for state-of-the-art class-specific object reconstruction sys¬ 
tems [21, 38] - we will present a proof of concept for this 
in Section 4. 

Related Work. The problem of generalizing from a few 
examples [34] was already studied in ancient Greece and 
has become known as induction. Early induction work in 
computer vision pursued feature sharing between different 
classes [1, 35]. One-Shot and Zero-Shot learning [14, 26] 
also represent related areas of research where the task is 
to learn to predict labels from very few exemplars. Our 
work differs from these as, in constrast to these approaches, 
the few examples we consider correspond to a small set of 
annotated object categories. In this sense, our approach is 
perhaps closer in style to attributes [13, 22], which explic¬ 
itly learn classifiers that are transversal to object classes and 
can hence be trained on a subset of object classes. Differ¬ 
ently, our “attributes” correspond to a dense discretization 
of the viewpoint manifold that implicitly aligns the shapes 
of all training object classes. Another relevant recent work, 
LSDA [18] learns object detectors using a seed set of classes 
having bounding box annotations. Unlike our work, they 
leverage available data for a related task (classification) and 
frame the task as adapting classifiers to object detectors. 

Pose estimation is crucial for developing a rich under¬ 
standing of objects and is therefore an important compo¬ 
nent of systems for 3D reconstruction [21, 5], recogni¬ 
tion [25, 33], robotics [30] and human computer interaction 
[24, 29]. Traditional approaches to object pose estimation 
predicted instance pose in context of a corresponding shape 
model [19]. The task has recently evolved to the prediction 
of category-level pose, a problem targeted by many recent 
methods [36, 28, 16]. Motivated by Palmer’s experiments 
which demonstrate common canonical frames for similar 
categories [27], we reason over cross-category pose - our 


work can be thought of as a natural extension in the current 
paradigm shift of pose prediction from instances/models to 
categories. 

2. Pose Induction for Object Instances 

We noted earlier that humans have the ability to infer rich 
representations, including pose, even for previously unseen 
object classes. These observations demonstrate the applica¬ 
bility of human inductive learning as a mechanism to infer 
desired representations for new visual data. We explore the 
possibility of applications of such ideas to induce the notion 
of pose for previously unseen object instances. More con¬ 
cretely, we assume pose annotations for some object classes 
and aim to infer pose for an object instance belonging to a 
different object category. We describe our formulations and 
approach below. 

2.1. Formulation 

Let C denote the set of object categories with available 
pose annotations. We follow the pose estimation formula¬ 
tion of Tulsiani and Malik [36] who characterize pose via 
Na = 3 euler angles - azimuth ( 0 ), elevation((^) and cyclo- 
rotation(t/;). We discretize the space of each angle in Nq 
disjoint bins and frame the task of pose prediction as a clas¬ 
sification problem to determine the angular bin for each eu¬ 
ler angle. Let {xi\i = 1... denote the set of annotated 
instances, each with its object class q G C, with pose an¬ 
notations The pose induction task is to predict 

the pose for a novel instance x whose object class c ^ C. 

2.2. Approach 

We examine two different approaches for inducing pose 
for a novel instance - 1) the baseline approach of explic¬ 
itly leveraging the inference mechanism for similar object 
classes and 2) our proposed approach of enforcing the infer¬ 
ence mechanism to implicitly leverage similarities between 
object classes and thereby allowing generalization of infer¬ 
ence to novel instances. 

Similar Classifier Transfer (SCT). We first describe the 
baseline approach which infers pose for instances of an 
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unannotated class by explicitly using similarity to some an¬ 
notated object category and obtaining predictions using a 
system trained for a visually similar class. To obtain a pose 
prediction system for the annotated classes C, we follow 
the methodology of Tulsiani and Malik [36] and train a 
VGG net [31] based Convolutional Neural Network (CNN) 
[15, 23] architecture with \C\^ Na^ Nq output units in the 
last layer. Each output unit corresponds to a particular ob¬ 
ject class, euler angle and angular bin - this CNN system 
shares most parameters across classes but has some class- 
specific parameters and disjoint output units. Let f{x; Wc) 
denote the pose prediction function for image x and class- 
specific CNN weights then computes the 

probability distribution over angular bins for instance i - the 
CNN is trained to minimize the softmax loss corresponding 
to the true pose label (0^, and /(x^, fEcJ- 

To predict pose for an instance x with class c ^ C, this 
approach uses the prediction system for a visually similar 
class c'. We obtain the probability distribution over angu¬ 
lar bins for this instance by computing /(x, Wc'). We then 
use the most likely hypothesis under this distribution as our 
pose estimate for the instance x. 

Generalized Classifier (GC). To infer properties for a 
novel instance, our proposed approach is to rely not only 
on the most similar visual object class, but also on general 
abstractions from all visual data - seeing a sheep for the first 
time, one would not just use knowledge of a specific class 
like cows, but also generic knowledge about four-legged an¬ 
imals. For example, the concept that pose of animals can be 
determined using generic part representations (head, torso 
etc.) can be learned if the annotations share a common 
canonical reference frame across classes and this notion can 
then be applied to novel related classes. These observations 
motivate us to consider an alternate approach, termed as 
Generalized Classifier (GC), where we train a system that 
exploits consistent visual similarities across object classes 
that coherently change with the pose label. This approach 
not only bypasses the need for manually assigning a visually 
similar class, it can also potentially learn abstractions more 
generalizable to unseen data and therefore handle novel in¬ 
stances more robustly. 

Concretely, we first obtain pose annotations across ob¬ 
ject classes wrt a common canonical frame (details de¬ 
scribed in experimental section) and train a category- 
agnostic pose estimation system. This implicitly enforces 
the CNN based pose estimation system to exploit similari¬ 
ties across object classes and learn common representations 
that may be useful to predict pose across object classes. 
We train a VGG net [31] based CNN architecture with 
Na * Ne output units in the last layer - the units corre¬ 
sponds to a particular euler angle and angular bin are shared 
across all classes. Let f{x;W) denote the pose prediction 
function for image x and CNN weights W, then CNN is 


trained to minimize the softmax loss corresponding to the 
true pose label (0^, t/;^) and f{xi,W). To predict pose 

for an instance x of an unannotated class c, we just com¬ 
pute f{x;W) - the alignment of all annotated classes to a 
canonical pose and implicit sharing of abstractions allow 
this system to generalize well to new object classes. 

2.3. Experiments 

Pose Annotations and Alignment. We evaluate the per¬ 
formance of our system on PASCAL VOC [11] object cat¬ 
egories. We obtain pose annotations for rigid categories via 
the PASCAL3D-F [39] dataset which annotates instances in 
PASCAL VOC and Imagenet dataset with their euler angles. 
The notion of a global viewpoint is challenging to define 
for various animal categories in PASCAL VOC and we ap¬ 
ply SfM-based techniques on ground truth keypoints to ob¬ 
tain the torso pose. We use keypoints annotations provided 
by Bourdev et al. [3] followed by rigid factorization [38] 
to obtain viewpoint for non-rigid pascal classes. The PAS- 
CAL3D-F annotations assume a canonical reference frame 
across classes - objects are laterally symmetric across X axis 
and face frontally in the canonical pose. We obtain similarly 
aligned reference frames for other object classes by aligning 
the SfM models to adhere to this constraint. 

Evaluation Setup. We held out pose annotations for four 
object classes - bus, dog, motorbike and sheep. We then 
finetuned the CNN systems, after initializing weights using 
a pretrained model for Imagenet [9] classification, corre¬ 
sponding to the two approaches described above using pose 
annotations for the remaining 16 classes obtained via PAS- 
CAL3D-F or PASCAL VOC keypoint labels. 

To evaluate the performance of our system for rigid ob¬ 
jects, we used the Accq metric [36] which measures the 
fraction of instances whose predicted viewpoint is within 
a fixed threshold of the correct viewpoint (we use 6> = |). 
The ‘ground-truth’ viewpoint obtained for some classes via 
SfM techniques is often noisy and the above metric which 
works well for exact annotations needs to be altered. To 
evaluate the system’s performance for these classes, we use 
an auxiliary task of predicting the ‘frontal/left/right/rear- 
facing’ label available in PASCAL VOC for these objects. 
We use our predicted azimuth for these objects and infer the 
‘frontal/left/right/rear-facing’ label based on the predicted 
azimuth. We denote the metric that measures accuracy at 
this auxiliary task as Accy . 

Results. We report the performance our baseline and pro¬ 
posed approach in Table 1 . For the SCT method, we used 
the weights from car, bicycle, cat and cow prediction sys¬ 
tems to predict pose for bus, motorbike, dog and sheep re¬ 
spectively since these correspond to the visually most simi¬ 
lar classes with available annotations. We note that the pre¬ 
dictions using both approaches are often very close to the 
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actual object pose and are significantly better than chance. 
We also observe that training a generalized prediction sys¬ 
tem is better than explicitly using a similar class (except for 
motorbike, where the bicycle class is very similar). This 
is perhaps because sharing of parameters and output units 
across classes enables learning shared abstractions that gen¬ 
eralize better to novel classes. 


Accil Accy 

6 ^ 


Approach 

bus 

mbike 

dog 

sheep 

SCT 

0.50 

0.58 

0.75 

0.58 

GC 

0.80 

0.55 

0.74 

0.78 


Table 1. Performance of our approaches for various novel object 
classes. 

We have described a methodology that aims to provide a 
richer description, in particular pose, given a single instance 
belonging to an novel class. We note that though human 
levels of precision and understanding for novel objects are 
still far away, the results imply that we can reliably predict 
pose without requiring training annotations, which is a step 
in the direction of visual systems capable of dealing with 
new instances. 

Importance of Similar Object Categories. To further gain 
insight into our prediction system, we focused on the ‘bus’ 
object category and trained two additional networks for the 
GC method by holding out ‘car’ and ‘chair’ respectively (in 
addition to the four held out categories above). In compari¬ 
son to Accil = 0.80, the Accil measure for bus in these two 
6 6 

cases was 0.73 and 0.81 respectively. The observed drop by 
holding out ‘car’ confirms our intuition regarding the im¬ 
portance of similar object categories in the seed set. 

3. Pose Induction for Object Categories 

When reasoning over a single instance of a novel cat¬ 
egory, any system, including the approaches in Section 2, 
can only rely on inference and abstractions on previously 
seen visual data. However, if given at once a collection 
of instances belonging to the new category, we can infer 
pose for all instances of the object class under considera¬ 
tion while reasoning jointly over all of their poses. This 
allows us to go beyond isolated reasoning for each instance 
and leverage the collection of images to jointly reason over 
and infer pose for all instances of the object class under 
consideration. Tackling the problem of inducing pose at a 
category level is particularly relevant as pose annotations 
for objects are far more tedious to collect than class labels - 
there are significantly more datasets with annotated classes 
than pose. Our method allows us to augment these available 
datasets with a notion of pose for each object. Our method 
can also be used in a completely unsupervised setting to in¬ 


fer pose for consistent visual clusters over instances that vi¬ 
sual knowledge extraction systems like NEIL [ ] automati¬ 
cally discover. 

One possible approach to reasoning jointly is to explic¬ 
itly infer intra-class correspondences, predict relative trans¬ 
formations and augment these with the induced instance 
predictions to obtain more informed pose estimates for each 
instance. However, the task of discovering correspondences 
across instances that differ in both pose and appearance, is 
a particularly challenging one and has been demonstrated 
only in limited pose and appearance variability [40, 32]. 
Our proposed approach provides a simpler but more robust 
way of leveraging the image collection. We build on the 
intuition that instances with similar spatial distributions of 
parts are close on the pose manifold. We define a similarity 
measure that captures this intuition and encourage similar 
instances to have similar pose predictions. 


Algorithm 1 Joint Pose Induction 
Initialization 
for i in test instances do 

Predict pose distribution F{xi;W) (Section 2) 
Compute K pose hypotheses and likelihood scores 
{{Rik, I3ik)\k e using F{xi\W) 

Compute similar instances Ni using Fi (eq 1) 

2;^ ^ argmax^i/c 

k 

end for 

Pose Refinement 
Vi, Update Zi (eq 6) until convergence 


3.1. Approach 

We first obtain multiple pose hypotheses for each in¬ 
stance by obtaining a diverse set of modes from the distri¬ 
bution predicted by the system described in Section 2 . We 
then frame the joint pose prediction task as that of selecting 
a hypothesis for each instance while taking into considera¬ 
tion the prediction confidence score as well as pose consis¬ 
tency with similar instances. We describe our formulation 
in detail below. 

Instance Similarity. For each instance i, we obtain a set of 
instances Ni whose feature representations are similar to in¬ 
stance i. Our feature representation for an instance is moti¬ 
vated by the observation that each channel in a higher-layer 
of a CNN can be reasoned as encoding a spatial likelihood 
of abstract parts. Let Ci{x^ k) denote the instance’s con¬ 
volutional feature response for channel k at location {x,y), 
our feature representation Fi is as follows. 




a{Ci{-,-,k)) 

\WiCX,;k))h 


( 1 ) 
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Figure 3. Viewpoint predictions for unoccluded groundtruth instances using our full system (’GC+sim’). The columns show 15th, 30th, 
45th, 60th and 75th percentile instances respectively in terms of the error. We visualize the predictions by rendering a 3D model using our 
predicted viewpoint. 


The above, where cr(') represents a sigmoid function, en¬ 
codes each instance via the normalized spatial likelihood of 
these ‘parts’. We use histogram intersection over these rep¬ 
resentations as a similarity measure between two instances 
and obtain the set of neighbors Ni for each instance. 

Unaries. For each instance i, we obtain K distinct pose hy¬ 
potheses {Rik\k G {1, ..,7r}} along with the correspond¬ 
ing log-likelihood scores I3ik. By G {1,.., TC}, we de¬ 
note the random variable which corresponds to the pose hy¬ 
pothesis we select for instance i. The log-likelihood scores 
for each pose hypothesis act as the unary likelihood terms. 

Pu{Zi = Zi) (X (2) 


Pc, weighted by a factor of A, leads to a higher joint prob¬ 
ability if predicted pose in consistent with pose for similar 
instances. We finally combine these two likelihood terms 
to model the likelihood for the pose hypotheses for a given 
instance. 

P{Zi = Zi) OC Pu{Zi)Pc{Zi)^ (5) 

Inference. We aim to infer the MAP estimates 2 ;* for all in¬ 
stances to give us a pose prediction via joint reasoning over 
all instances. We use iterative updates and at each step, we 
condition on all the unknown variables except a particular 
the update for assignment Zi as follows - 


Pose Consistency. Let A(Pi, P 2 ) = denote 

the geodesic distance between rotation matrices Pi, P 2 and 
X denote the indicator function. We model the consistency 
likelihood term as the fraction of instances in Ni with a sim¬ 
ilar pose. 


2:* = argmax(/?ife + A^og( I{A{Rik,Rjzj) < S))- 

jeN(i) 

\log{Y,AMRik,Rjz,)<m (6) 
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Y.I{A{Riz,,Rizi)<5) 

PciZi = Zi) OC - (3) 

While this formulation encourages similar pose estimates 
for neighbors, it is biased towards more ’popular’ pose esti¬ 
mates (if the dataset has more front facing bikes, it is more 
likely to find neighbors for the corresponding pose hypoth¬ 
esis). Motivated by the recent work of Isola et al. [20], who 
use Pointwise Mutual Information [12] (PMI) to counter 
similar biases, we normalize by the likelihood of randomly 
finding similar pose estimates for neighbors to yield - 


Pc{Z, 


ZAMRiz..Rjz,)<S) 

eNi 

Y.X{N{Ri,^,Rj,.)<8) 


(4) 


Formulation. Pu favors the pose hypotheses that are 
scored higher by the instance pose induction system and 


Our overall method, as summarized in Algorithm 1, 
computes pose estimates for every instance of a novel object 
class given a large collection of instances. 

3.2. Experiments 

The aim of the experiments is twofold - 1) to demon¬ 
strate the benefits of jointly reasoning over all instances of 
the class and 2) to show that a spatial feature representa¬ 
tion capturing abstract parts, as defined in eq 1, yields better 
performance than alternatives for improving pose estimates. 
We follow the experimental setup previously described in 
Section 2.3 and build on the ‘GC’ approach. Our method 
using spatial features (from Conv5 of VGG net) is denoted 
as ‘GC+c5’ and the alternate similarity representation us¬ 
ing fc7 features from VGG net is denoted as ‘GC+fc7’. We 
visualize the performance of our system in Figure 3 where 
the columns show 15^^ — 75^^ percentile instances, when 
sorted in terms of error. We observe that the predictions are 
accurate even around the 60^^ — 75^^ percentile regime. 
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Figure 4. Viewpoint predictions for novel object classes without any pose annotations. The columns show randomly selected instances 
whose azimuth is predicted to be around ^(right-facing), 0(front-facing), |(left-facing) respectively. 
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Figure 5. Failure modes. Our method is unable to induce pose for object classes which drastically differ from 
the annotated seed set. The columns show randomly selected instances whose azimuth is predicted to be around 
^(right-facing), 0(front-facing), f, |(left-facing) respectively. 


Acck AcCy 

6 ^ 


Approach 

bus 

mbike 

dog 

sheep 

GC 

0.80 

0.55 

0.74 

0.77 

GC+fc7 

0.76 

0.51 

0.73 

0.75 

GC+c5 

0.86 

0.60 

0.74 

0.79 

Table 2. Joint reasoning for Pose Induction. 


AcCil 
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ACCy 

Setting 

bus 

mbike 

dog 

sheep 

All 

0.86 

0.60 

0.74 

0.79 

Confident 

0.97 

0.76 

0.89 

0.90 


Table 3. Performance for Confident Predictions. 


We see that the results in Table 2 clearly support our two 
main hypotheses - that given multiple instances of a novel 
category, jointly reasoning over all of them improves the 
induced pose estimates and that the feature representation 
we described further improves performance. 

An additional result that we show in Table 3 is that if we 
rank the predictions by confidence (eq. 5) and take the top 
third confident predictions, error rates are significantly re¬ 
duced. This means that the pose induction system has the 
desirable property of having low confidence when it fails. 
As we demonstrate later, for various applications e.g. shape 
model learning, we might only need accurate pose estimates 
for a subset of instances and this result allows us to automat¬ 
ically find that subset by selecting the top few in terms of 
confidence. 

3.3. Qualitative Results 

The evaluation setup so far has focused on PASCAL 
VOC object classes because of readily available annotations 
to measure performance. However, the aim of our method 


is to be able to infer pose annotations for any novel object 
category. We can qualitatively demonstrate the applicability 
of our approach for diverse classes using the Imagenet ob¬ 
ject classes. Figure 4 shows the predictions of our method 
for several classes for which we do not use any pose an¬ 
notations (we use randomly selected instances from the top 
third, in terms of prediction confidence, to visualize the pre¬ 
dictions in Figure 4). It is clear that the system performs 
well on animals in general as well as for other classes re¬ 
lated to the initial training set (eg. golfcart, motorbike). 
While we can often infer a meaningful representation of 
pose even for some classes rather different from the initial 
training classes e.g. hammer, object categories which differ 
drastically from the annotated seed set (eg. jellyfish, vac¬ 
uum cleaner) are the principal failure modes as illustrated 
in Figure 5. 

4. Shape Modelling for Novel Object Classes 

Acquiring shape models for generic object categories 
is an integral component of perceiving scenes with a rich 
3D representation. The conventional approach to acquiring 
shape models includes leveraging human experts to build 
3D CAD models of various shapes. This approach, how¬ 
ever, cannot scale to a large number of classes while captur¬ 
ing the wildly different shapes in each object class. Learn¬ 
ing based approaches which also allow shape deformations 
[2] provide an alternative solution but typically rely on some 
3D initialization [6]. Kar et al. [21] recently showed that 
these models can be learned using annotations for only 
object silhouettes and a set of keypoints. These require¬ 
ments, while an improvement over previous approaches, are 
still prohibitive for deploying similar approaches on a large 
scale. Enabling such approaches to learn shape models in 
the wild - given nothing but a set of instances, is an im¬ 
portant endeavor as it would allow us to scale shape model 
acquisition to a large set of objects. 

We take a step towards this goal using our pose induction 
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object class using just silhouette annotations. 



Figure 6. Mean shape models learnt for motorbike using a) top : 
all pose induction estimates b) mid : most confident pose induction 
estimates c) bottom : ground-truth keypoint annotations. 

system - we demonstrate that it is possible to learn shape 
models for a novel object category using just object silhou¬ 
ette annotations. We build on the formulation by Kar et al. 
[21] and note that they mainly used keypoint annotations to 
estimate camera projection parameters and that these can be 
initialized using our induced pose as well. We briefly review 
their formulation and describe our modiflcations that allow 
us to learn shape models without keypoint annotations. 

Formulation. Let = (P^, q, represent the projec¬ 
tion parameters (rotation, scale and translation) for the 
instance. Kar et al. obtain these using the annotated key- 
points and we instead initialize the scale, translation param¬ 
eters using bounding box scale, location and the rotation 
using our induced pose. Their shape model M = (S^V) 
consists of a mean shape S and linear deformation bases 
V = {Vi,., Vx}. The energies used in their formulation 
enforce that the shape for an instance is consistent with its 
silhouette (P^, Ec), shapes are locally consistent (P/), nor¬ 
mals vary smoothly (P^) and the deformation parameters 
are small (||Q^i/cV/c|||^) (they also use a keypoint based en¬ 
ergy P/cp which we ignore). We refer the reader to [21] 
for details regarding the optimization and formulations of 
shape energies. While Kar et al. only optimize over shape 
model and deformation parameters, we note that since our 
projection parameters are noisy, we should also reflne them 
to minimize the energy. Therefore, we minimize the ob¬ 
jective mentioned in eq. 7 over the shape model, deforma¬ 
tion parameters as well as projection parameters (initialized 
using the induced pose) to learn shape models of a novel 


_min Ei{S, V) + + Ei + Ei + YiW^ikVkWD) 

S,V,a,P 

i k 

subject to: P* = P + y^Q^i/cVfe 

k 

( 7 ) 

Results. We use the unoccluded instances of the class mo¬ 
torbike to demonstrate the applicability of our pose induc¬ 
tion system for shape learning. Since we are interested in 
learning a shape model for the class, we can ignore some 
object instances for which we are uncertain regarding pose. 
As shown in table 3, we can use the subset of most confldent 
pose estimates to get a higher level of precision. Figure 6 
shows that our model learnt without any keypoint annota¬ 
tion is quite similar to the model learnt by Kar et al. using 
full annotations and that using the subset of instances with 
confldent pose induction predictions substantially improves 
shape models. The learnt model demonstrates that our pose 
induction system makes it is feasible to learn shape models 
for novel object classes without requiring keypoint annota¬ 
tions. This not only qualitatively verifles the reliability of 
our pose induction estimates, it also signifles an important 
step towards automatically learning shape representations 
from images. 

5. Conclusion 

We have presented a system which leverages available 
pose annotations for a small set of seed classes and can in¬ 
duce pose for a novel object class. We have empirically 
shown that the system performs well given a single instance 
of a novel class and that this performance is signiflcantly 
improved if we reason jointly over multiple instances of 
that class, when available. We have also shown that our 
pose induction system enables learning shape representa¬ 
tions for object classes without any keypoint/3D annota¬ 
tions required by previous methods. Our qualitative results 
on Imagenet further demonstrate that this approach gener¬ 
alizes to a large and diverse set of object classes. 
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